Unveiling the potential of machine learning approaches in predicting the emergence of stroke at its onset: a predicting framework

A stroke is a dangerous, life-threatening disease that mostly affects people over 65, but an unhealthy diet is also contributing to the development of strokes at younger ages. Strokes can be treated successfully if they are identified early enough, and suitable therapies are available. The purpose of this study is to develop a stroke prediction model that will improve stroke prediction effectiveness as well as accuracy. Predicting whether someone is suffering from a stroke or not can be accomplished with this proposed machine learning algorithm. In this research, various machine learning techniques are evaluated for predicting stroke on the healthcare stroke dataset. The feature selection algorithms used here are gradient boosting and random forest, and classifiers include the decision tree classifier, Support Vector Machine (SVM) classifier, logistic regression classifier, gradient boosting classifier, random forest classifier, K neighbors classifier, and Xtreme gradient boosting classifier. In this process, different machine-learning approaches are employed to test predictive methods on different data samples. As a result obtained from the different methods applied, and the comparison of different classification models, the random forest model offers an accuracy rate of 98%.


Related work
The literature that came before the intended study is covered in this section.Clinical data processing and current medical diagnostics have benefited from AI computational dependability.Human vision fails where AI succeeds.
Nowadays, a lot of systems used in the medical research sector incorporate a variety of machine learning approaches for data processing and new discoveries.The analysis of a number of recent studies in the field of healthcare has been done using machine learning techniques.In numerous prior studies, researchers have leveraged machine learning techniques to predict and forecast outcomes associated with strokes.The results of various scholars in this field are displayed below.
By Minhaz et al. 8 , information on strokes was gathered from many hospitals in Bangladesh.The data is cleaned and prepared before being used in the training phase, where 10 different algorithms are used.We then employ a weighted voting classifier for all classifiers to raise the bar.The best model for classification is then selected; next, only that model is optimized after each succeeding model has been optimized using a weighted vote classifier.The study's findings indicate that the weighted voting classifier can achieve an accuracy of up to 97%.
The ability to recognize a stroke using biometric signals from an Electroencephalogram (EEG) recorded while a person is walking was demonstrated in research by Yoon-A et al. 9 .According to the study's authors, random forest can efficiently detect strokes using biometric information.In order to predict strokes, Priya et al. 10 use machine learning methods and text-mining tools.They use 14 different techniques to classify the data, including a complex tree, a simple tree, a medium tree, a linear SVM, a quadratic SVM, an ANN, and logistic regression.According to experimental results, ANN has a higher accuracy than other methods, at 95.3%.
Several Kaggle datasets were used in data preprocessing by Sailasya et al. 11 , which addressed problems like handling missing values, label encoding, and imbalanced data.This dataset is cleaned and prepared before being analyzed using six distinct machine learning methods.Comparing the group's accuracy to other algorithms, the Nave Bayes classification has the greatest rate (82%).In order for users to determine whether they have had a stroke on their own, they have created an HTML webpage where they can enter information.To predict strokes, Hager et al. 12 used four classification methods: logistic regression, support vector machine (SVM), random forest, and decision tree.In order to achieve this result, machine learning algorithms utilized hyperparameter tuning and cross-validation.In a comparison of the results of these four models, random forest came out on top with a 90% accuracy rate.
Using asymmetric data, Wu et al. 13 constructed a model to predict To balance the unbalanced data that they had obtained from a Chinese longitudinal study of Healthy Longevity, the researchers in this study used RUS, ROS, and SMOTE techniques.The authors of this study used regularized random forest, Support Vector Machine, and logistic regression for predicting strokes in both balanced and imbalanced datasets, contrasting the best results from each model to those from the individual datasets.They found that the accuracy of RLR and SVM had the highest (95%) but the lowest sensitivity for the unbalanced dataset (0.1).
An investigation conducted by Badriyah et al. 14 gathered data from CT scans of stroke patients in Surabaya, Indonesia.Images are pre-processed using methods like cropping, grayscale, data conversion, data augmentation, and scaling with the goal of improving the image quality.Using image data is another aspect of feature extraction.The efficacy of eight algorithms used for machine learning is then evaluated, including logistic regression, Naive Bayes, random forest, and decision tree.By achieving a 95.97% accuracy rate, this experiment demonstrates that random forest outperforms all other classifiers.For the purpose of predicting stroke risk, Jaehak et al. 15 studied real-time biosignals combined with artificial intelligence.With both the Long Short-Term Memory (deep learning) and random forest algorithms (machine learning) being used, Long Short-Term Memory (LSTM) attains the highest accuracy in this configuration.With the help of EEG data and more deep learning models, Yoon-A et al. 16 have published another study in which they found that LSTM has the highest accuracy (94%), producing the best results.
An algorithm for stroke prediction has been developed by Singh et al. 17 and compared to a variety of other methods on the dataset "Cardiovascular Health Study (CHS)".To collect features, a decision tree was used, dimensions were reduced using a PCA method, and the classification model was constructed using a backpropagation neural network algorithm.After analyzing and merging classification efficiencies using several approaches and heterogeneity models, based on the analysis, 97.7% accuracy was found in the prediction of stroke disease.A hybrid machine learning method has been developed by Liu et al. 18 for cerebral stroke prognosis prediction according to class imbalance measurements and limited physiological evidence.In the two steps of the procedure were used to represent how it changed over time.Data that were missing were filled in using random forest regression prior to categorization.Second, on an uneven dataset, stroke predictions were made using a DNN-based automatic hyperparameter optimization (AutoHPO).There are 43,400 patient records in the medical dataset, and 783 of them are stroke instances.The false negative rate for this prediction strategy was 19.1%, which is lower than the average for more conventional techniques by about 51.5%.A 33.1% false positive rate, a 71.6% accuracy, and a 66.4% sensitivity were estimated for the suggested technique.
According to Asadi et al. 19 , a thorough review of patient files was carried out to determine whether endovascular therapy was beneficial for acute ischemic stroke.Using SPSS, MATLAB, and Rapidminer, respectively, conventional statistical analysis and ANN analysis was performed.A supervised method has been developed for identifying good and bad predictors using support vector machine algorithms.These algorithms were taught, tested, and ranked using randomly divided data.For acute anterior circulation ischemic stroke, endovascular therapy was performed on 107 patients.A total of 66 males were present, with an average age of 65.The models took into account every conceivable clinical, economic, and practical factor.In the resulting confusion matrix, the target and output groups of the neural network were approximately 80% congruent, and receiving operative properties were favorable.The performance of the assist vector machine was enhanced through optimization, reaching a respectable 2.064 root mean square error.Conducting recurrent studies on people who had an acute ischemic stroke suggested by Heo et al. 20 .After three months, an outcome was considered successful if the adjusted Rankin Scale score was 0-1 or 2. Random forest, logistic regression, and DNN were all developed and contrasted as the three prediction techniques.
The Bayesian Rule was described by Letham et al. 21, which generates a posterior distribution across possible choices.The results showed that predictions made using Bayesian Rule Lists were generally accurate.As a result of recent advances in precision medicine, the approach could provide exceptionally accurate and understandable patient (CHADS2 score) scoring systems, putting it on par with the most sophisticated machine learning prediction algorithms.A number of techniques based on machine learning were employed by Emon et al. 8 to categorize stroke patients, including Segmental Gradient Descent, logistic regression, KNN, decision trees, XGBoost, gradient boosting, and AdaBoost.The weighted voting classifier in that study had a 97% accuracy rate.Studies were conducted by Amini et al. to forecast strokes 22 .They identified 50 risk factors for alcohol use, smoking, hyperlipidemia, diabetes, stroke, and other conditions among 807 healthy and ill individuals.The k-nearest neighbor algorithm and a decision tree algorithm with a c4.5 tree structure were employed; these algorithms have accuracy rates of 94% and 95%, respectively.In relation to brain strokes, Sirsat et al. 23 investigate the use of machine learning.The authors comprehensively review existing literature to identify trends, methodologies, and key findings in the utilization of machine learning techniques for stroke prediction, detection, and diagnosis.Through a thorough examination of relevant articles, the review aims to provide insights into the current state of research, addressing the advantages and disadvantages of various methods.This survey provides an invaluable reference for researchers and practitioners within the discipline, offering a consolidated overview of advancements, challenges, and potential areas for future contributions in leveraging machine learning for addressing issues related to brain strokes.According to Cheng et al. 24 , an ischemic stroke's prognosis can be estimated.Two ANN models were used in the study; data sets from 82 ischemic stroke patients were analyzed, and 95% and 79% accuracy percentages were obtained, respectively.An investigation by Cheon et al. 25 examined whether stroke patients could be predicted to die.The occurrence of strokes was calculated in their study based on 15,099 participants.In order to identify strokes, they used deep neural network methods.
A PCA model was utilized by the authors to extract data from stroke medical records and predict stroke incidence.In their case, the area under the curve is 83%.A method of lengthening.Their CNN approach had an accuracy rate of 90%.The Sung et al. 26 study led to the development of a stroke severity index.Chin et al. 's 27 study examined the effectiveness of an automated technique for early ischemic stroke detection.Using convolutional neural networks (CNNs), they developed a method to automate primary ischemic stroke.The researcher gathered 256 images in order to create and test the CNN model.They used the information gathered on 3577 victims of acute ischemic stroke to expand the picture that was gathered for their system's image preprocessing.In addition to linear regression, they used a number of data mining techniques to build their predictive models.Based on 95% confidence intervals, as a result, they were more accurate than the K-nearest neighbor algorithm.
Machine learning was used to predict functional outcomes after an ischemic stroke by Monteiro et al. 28 .A patient who passed away three months after being admitted was used as a test case for this procedure.The AUC value of their study was higher than 90.The study conducted by Kansadub et al. aimed to determine the risk of stroke.To analyze the data and forecast strokes, the study's authors employed neural networks, Naive Bayes, neural networks, and decision trees.In their investigation, the authors examined the AUC and accuracy of their pointer.Their results indicated that the most accurate algorithm was naive Bayes, which was classified as a decision tree.The research article by Kansadub et al. 29 focuses on developing a stroke risk prediction model by employing demographic information.The authors conducted a literature survey to explore existing studies and methodologies related to stroke risk prediction.They likely reviewed relevant publications that pertain to biomedical engineering, epidemiology, and machine learning in order to understand the current landscape of stroke prediction models.The survey may have covered aspects such as data sources, demographic variables, modeling techniques, and evaluation metrics used in previous works.By synthesizing information from the literature, the authors aimed to identify gaps, challenges, and opportunities in the existing approaches, providing a foundation for their own research.Demographic data is used in creating a stroke risk prediction model.Research by Adam et al. 30 was conducted to determine the ischemia stroke classification.To categorize ischemic strokes, the researchers used decision trees and the k-nearest neighbor method.According to healthcare providers, decision trees are more effective at classifying strokes.
In a study by Jafar et al. 31 , our method, which enhanced the model's performance and decreased overfitting, identified the top 10 most crucial features by combining the LASSO FS and the HyperOpt optimization of the GB model.As a result, the model was easier to understand and heart disease risk assessment and early detection were enhanced.An automated system for predicting HypGB heart disease was presented in this work.The www.nature.com/scientificreports/system outperformed all other existing machine learning models, achieving 97.32% accuracy on the Cleveland heart disease dataset and 97.72% accuracy on the Kaggle heart failure dataset.Mariano et al. 32 introduced an efficient and rapid method for generating extensive datasets tailored for machine learning applications in the classification of brain strokes using microwave imaging equipment.Their suggested method relies on the linearization of the scattering operator and the distorted Born approximation, which speeds up the process of creating large training datasets that are necessary for machine learning algorithms.This approach aims to minimize the effort required in creating the dataset.Each classifier within the system possesses the capability to determine the occurrence of a stroke, classify its type (ischemic or hemorrhagic), and pinpoint its precise location in the brain.The effectiveness of the trained procedures was assessed using data sets that were produced by full-wave simulations of the entire system.
A recent exploration by srinivas et al. 33 , the suggested soft vote classifier uses 3 base classifiers: random forest, Extremely Randomized Trees and HistogramBased gradient boosting.The gradient boosting algorithm in machine learning has a variation known as Histogram-Based gradient boosting (HBGB), which substitutes the conventional method of using a single decision tree with histograms to approximate the underlying distribution of the data.It works especially well with datasets that have a large of features.With the use of a soft voting classifier the F1 overall score is 97%,accuracy achieved with the model is 96.88%.
Maheswara rao et al. 34 suggest that by utilizing specific features, the suggested ML-IHDAS creates classification models utilizing Support Vector Machine (SVM), random forest (RF), gradient boosted decision trees (XGBoost), and logistic regression (LR).According to the experimental study, XGBoost used minimal features to achieve high accuracy, while MRMR performed better for training machine learning models than Kruskal, Chi2, and ANOVA.MRMR performs better than other methods for selecting features for training all deployed machine learning models, and it improves the effectiveness of the XGBoost approach.
A comparison table of all the associated work is provided in Table 1 above.For the vast majority of investigations, an accuracy rate of 90% was thought to be favorable.The unique aspect of our research is the use of a number of well-known machine learning techniques to achieve our desired results.With corresponding F1-scores of 94, 87, 96, and 91%, random forest (RF), decision tree (DT), logistic regression (LR), and voting classifier (VC) were the most effective methods.It demonstrates the reliability of the models used in this study by having a significantly higher accuracy percentage than earlier studies.They have been shown to be reliable in several model comparisons, and the analysis results may be used to create the scheme.
It is clear from this section that several machine learning techniques, text-mining tools, and biometric signals are used to achieve an accuracy level that is unparalleled.As a result, the outcomes of all methods and approaches are inconsistent, which is actually perplexing.Due to this, a simple method based on machine learning is employed here, which has been capable of produce the high accuracy rate in comparison to the earlier work.To further enhance the accuracy of our accuracy prediction model, we additionally used feature selection algorithms to extract some of the most essential features that contribute to the accuracy of the model.

Materials and methods
In this part, we discuss our recommended methodology.As shown in Fig. 1, the system functions as described.The sections that follow provide a comprehensive overview of the major steps technique.

The stroke disease prediction system
The structure of the stroke disease prediction system is shown in Fig. 1

Data acquisition
In our study, we utilized stroke data from the Health Care dataset 35 to construct our dataset.Given the rarity of stroke data, sourcing from reliable datasets is imperative for robust model training and accurate forecasting.This dataset is used to predict the risk of stroke in a patient based on variables such as age, gender, and the presence or absence of specific diseases and smoking.A part of the original training data is chosen using the filtering technique for applications such as machine learning and data visualization.

Data preprocessing
Data preprocessing is necessary before constructing the model in order to eliminate outliers and noise that might lead to invalid training results 36 .This step addresses anything that prevents the model from functioning more efficiently.We used the following data pretreatment techniques in this article.
(1) Handling Null Value: When there is no data provided for a single item or a whole unit, missing data can occur.In the real world, it is a very significant issue.Then, we determined whether there were any missing values in the data set using isnull().In the context of handling missing values, imputation methods are techniques used to replace or fill in missing values with estimated or calculated values.Common imputation methods include mean imputation, median imputation, and forward or backward filling.These methods are applied after identifying missing values using isnull().The missing value of the attribute is filled in or substituted with the median value of that specific column if any missing values are discovered.These approaches were essential in maintaining data completeness and avoiding bias in our subsequent analyses.(2) Label Encoding: The data set contains some string information, so in order for the string values to fit in the model, they must be converted to a float value.To achieve this, we used the dataset's label encoding.Figure 2 below shows the data preprocessing overview of the given dataset.
Figure 3 illustrates the unbalanced dataset used to predict strokes.In the dataset, 5110 rows are present, 249 of which suggest the possibility of a stroke, and 4861 indicate its absence.In spite of the fact that such data can be used for the training of a machine-level model, other accuracy metrics, including precision and recall, are not adequate.It is essential to treat uneven data carefully to avoid inaccurate conclusions and useless forecasts.An efficient model can only be obtained by addressing those uneven data points first.For this, the oversampling technique was used.Figure 4 shows the balance output column from the dataset.www.nature.com/scientificreports/

Feature visualisation
A dataset can be composed of vectors, points, patterns, occurrences, records, observations, samples, entities, or instances.Objects of data are distinguished by a set of characteristics that describe the fundamental properties of those objects, like the mass of an object or the time at which an event occurred.To see how the datasets relate to one another, an urban and rural visualization is displayed.Finally, a heat map correlation of all the motions is displayed to give an understanding of the dataset's features and properties.Figure 5 below shows how various aspects are visualized.

Features selection
Features selection techniques simplify the input variables by selecting only those that are believed to be most helpful to predicting the target variable.To discover the optimal characteristics for our model, we used the two feature selection techniques listed below in this post.

Feature importance gradient boosting and random forest classifier
Procedures that determine as essential (or valuable) for the predictive model input features are referred to be feature important.Each feature's "importance" during prediction is indicated by the score.A feature with a high score value has a greater influence on the model ability to predict a given variable.By reducing the quantity of input features, this strategy helps us grasp the data and model better.Using the predictive model feature significance property, this approach calculates the feature score for each dataset feature and eliminates the features with low scores to obtain the best possible score, which streamlines the model and increases performance.Also, a built-in class that goes along with Tree-Based classifiers is called feature significance.
Gradient boosting classifier.The gradient boosting algorithm was applied to iteratively boost the predictive power of the model.It is an ensemble learning approach that builds a series of weak learners to create a strong predictive model.One of the advantages of gradient boosting is its inherent ability to provide feature importance scores.Here, as shown in Fig. 6, we used a gradient boosting classifier to extract the key 10 items from the data set.BMI, Age and Average Glucose Level seem to be the most significant features of the gradient boosting classifier.
Random forest classifier.Another technique for ensemble learning is random forest, which constructs several decision trees and aggregates their predictions.Similarly to gradient boosting, random forest provides feature importance scores.Random forest is an ensemble learning approach that builds multiple decision trees during training and results in the mode of the classes for classification tasks or the mean prediction for regression tasks.Feature importance analysis in a random forest classifier is a valuable aspect, providing insights into the relevance of each feature for making accurate predictions.Random forest's feature importance analysis provides a way to identify and prioritize features according to how they contribute to the predictive accuracy of the model.Understanding these importance scores can offer valuable insights into the key factors influencing the model decision-making process, which is crucial for interpreting and trusting the model predictions.
The above Fig.7 represents the feature importance of the random forest classifier.BMI, Age and Average Glucose level are the most significant features in the random forest classifier.

Correlation matrix with heatmap
A two-dimensional matrix displaying the correlation coefficients between the characteristics is called a correlation matrix with a heatmap.In order to determine which characteristic in the data set is most connected to the target variable, it must know how the strength of the relationships and how the features are related to one another.Typically, there are several numerical variables in a correlation plot, each represented by a column.The rows in this table show the associations between two variables.The strength of the link is demonstrated by the value of the cell; generally, values of positives indicate a positive relationship, while values of negatives indicate  www.nature.com/scientificreports/The equation provided above illustrates the probability that data point x i belongs to category 1.The parameters β are represented as Following this, the likelihood that the data points are part of category value 0 can be articulated as: Equations ( 1) and ( 2) can be combined to form Eq. ( 5): Equation ( 5) functions as the model for classification, with a dependency on the parameter β.The determination of β can be achieved by employing the method of maximum likelihood.
(2) Decision tree classification Regression and classification issues are handled using the decision tree classification algorithm 38 .The information components compare yield variables in this approach, which is also a supervised learning strategy.It is made to look like a tree.The information that this algorithm generates is continually separated as specified by a particular border.A DT algorithm consists of the Decision Node and the Leaf Node.Previously, the information was split into multiple hubs, and the last choice output the result.
(3) Random forest classification Random forests 39 are constructed from a large number of autonomous choice trees that are freely created on a selected portion of the data set.All of these decision trees are created throughout training, and they all provide results.Voting is a method of determining the final prediction composed by this classification system.According to this method, every decision tree supports a certain result class.A random forest is used to choose the final expectation based on the most votes from the classes.
The computation of this partition can be accomplished via the subsequent formulation (5).
The symbol pi is utilized to denote the likelihood of a specific data point being classified as belonging to class i.
(4) Support vector machine Classification and regression are both solved using Support Vector Machines (SVM) 40 , an extremely popular supervised learning algorithm.This algorithm is primarily designed to address machine learning classification problems.The objective of the SVM calculations is to generate an optimal line or option limit that can categorize the n-layered spaces, allowing us to easily classify the new informative item later on.A hyper-plane is a name for this best-choice restriction.
Given a training dataset D = (x i , d i ) N i=0 , x i ǫR n , d i ǫ − 1,1 the objective is to ascertain the corresponding response dǫ − 1,1 for a pattern xǫR n , where x = x i for every i.
Let α = α 0 α 1 α 3 ...α p The expression representing the equation of the hyperplane that separates can be stated as follows: Consequently, a hyperplane equation is expressed as follows: Moreover, it is necessary to generate two parallel hyperplanes in relation to the hyperplane that has been constructed.
(2) Slow learning algorithms, such as K-NN, in which all computations are saved until classification is complete without any specific pre-processing steps.A feature map is used to determine which training data points are closest to the feature classification point.K-NN classifier predicts the target class using the Euclidean distance metric.The dataset determines the optimal value for the classifier's performance-controlling parameter, k.After examining the effects, the ideal value is then decided.In our experiment, K = 3 was used.

(6) Gradient boosting classifier
The gradient boosting method 41 is a method of machine learning that is used for various types of problems, including regression and classification.Most often, decision trees make up the prediction model, which consists of a collection of weak predictive models.In cases where a decision tree fails to learn, gradient-boosted trees perform better than random forests.Gradient-boosted tree models are constructed stage-by-stage, just as previous boosting approaches were, but they are generalized by being able to optimize any loss function that can be differentiated.

(7) XGBoost classifier
An extreme gradient boost method, known as XGBoost or Extreme gradient boosting, is a scalable, distributed machine-learning package.With parallel tree boosting, this is the best machine-learning library that performs regression, classification, and ranking.

(A) Evaluating Classifiers
This study assessed the performance and applicability of classifiers using five statistical variables: accuracy, precision, recall/sensitivity, F1 score, and AUC curve.Below is a definition of each statistical parameter.
• Accuracy demonstrates how well a classification system has performed, as follows: • Precision can be defined as the proportion of positively classified examples that are within the proportion of positively predicted examples 42 .The following equation can be used to calculate precision: • The link between Precision and Recall is represented by the measurement known as the F-measure.The smaller value of Precision or Recall will always be closer to F-Measure than the opposite 43 .The f-measure equation can be found below: • Recall Below is a description of the recall equation: The model performance has been evaluated using a confusion matrix that examines accuracy, precision, recall, and f-measure.
When applied to a set of test data, a model performance is described by a confusion matrix.It provides the classifier with two different types of accurate predictions and two different types of inaccurate guesses 44 .Table 2 (11) below describes the confusion matrix of this prediction model.Table 3 3. Confusion matrix for all machine learning model.
The confusion matrix for the random forest classifier with TP (True Positive) of 476, FN (False Negative) of 14, FP (False Positive) of 0, and TN (True Negative) of 616 indicates an exceptional performance of the model in predicting strokes.The absence of FP instances signifies that the model did not make any incorrect predictions of non-stroke cases as positive, demonstrating high specificity.The substantial TP count reflects the model effectiveness in accurately identifying positive cases, highlighting its high sensitivity.The small FN instances suggest that the model missed only a few actual stroke cases, indicating a strong overall performance.With perfect precision, recall, specificity, and F1 Score metrics, this random forest model exhibits excellent predictive capabilities.Further analysis and model optimization may focus on maintaining or improving these metrics while considering potential challenges or limitations in the dataset.
The confusion matrix for the gradient boosting classifier reveals that out of the total instances, there are 411 True Positives (TP), 79 False Negatives (FN), 36 False Positives (FP), and 580 True Negatives (TN).This implies that the model correctly identified 411 instances where strokes were present (TP), but it failed to detect 79 actual stroke cases (FN).The 36 False Positives indicate instances where the model incorrectly predicted the occurrence of a stroke.On the positive side, the model accurately identified 580 instances as non-strokes (TN).The sensitivity (recall) of the model, calculated as TP/(TP + FN), is an important metric for stroke prediction, and in this case, it's crucial to assess and potentially enhance the model ability to capture true positive stroke cases.Further refinement and parameter tuning may be considered to improve the overall performance of the gradient boosting classifier for stroke prediction.
The confusion matrix for the Support Vector Classifier (SVC) indicates that among the total instances, there are 357 True Positives (TP), 133 False Negatives (FN), 94 False Positives (FP), and 522 True Negatives (TN).True Positives represent instances where the model correctly identified the presence of strokes, and in this case, it successfully recognized 357 instances.However, there were 133 cases of False Negatives, meaning the model failed to identify actual stroke cases.The 94 False Positives suggest instances where the model incorrectly predicted the occurrence of a stroke.On the positive side, the model accurately identified 522 instances as non-strokes (True Negatives).The sensitivity (recall) of the model, calculated as TP/(TP + FN), is an important metric for stroke prediction, and potential improvements may involve minimizing False Negatives to enhance the model ability to capture true positive stroke cases.Further refinement and parameter tuning could contribute to optimizing the Support Vector Classifier for stroke prediction.
The confusion matrix for the KNeighbors Classifier reveals that among the total instances, there are 454 True Positives (TP), 36 False Negatives (FN), 0 False Positives (FP), and 616 True Negatives (TN).True Positives indicate instances where the model correctly identified the presence of strokes, and in this case, it successfully recognized 454 instances.There were 36 cases of False Negatives, suggesting instances where the model failed to identify actual stroke cases.Importantly, there were 0 False Positives, indicating that the model did not make any incorrect predictions regarding the occurrence of strokes.On the positive side, the model accurately identified 616 instances as non-strokes (True Negatives).The sensitivity (recall) of the model, calculated as TP/(TP + FN), is a crucial metric for stroke prediction, and achieving a high True Positive rate with zero False Positives is favorable for the reliability of the KNeighbors Classifier in identifying stroke cases.
The confusion matrix for the Extreme gradient boosting (XGBoost) Classifier indicates that among the total instances, there are 462 True Positives (TP), 28 False Negatives (FN), 0 False Positives (FP), and 616 True Negatives (TN).True Positives represent instances where the model correctly identified the presence of strokes, and in this case, it successfully recognized 462 instances.There were 28 cases of False Negatives, suggesting instances where the model failed to identify actual stroke cases.Importantly, there were 0 False Positives, indicating that the model did not make any incorrect predictions regarding the occurrence of strokes.On the positive side, the model accurately identified 616 instances as non-strokes (True Negatives).The sensitivity (recall) of the model, calculated as TP/(TP + FN), is a crucial metric for stroke prediction, and achieving a high True Positive rate with zero False Positives is favorable for the reliability of the XGBoost Classifier in identifying stroke cases.
The confusion matrix for the Gaussian Naive Bayes Classifier reveals that out of the total instances, there are 463 True Positives (TP), 27 False Negatives (FN), 0 False Positives (FP), and 616 True Negatives (TN).True Positives represent instances where the model correctly identified the presence of strokes, and in this case, it successfully recognized 463 instances.There were 27 cases of False Negatives, suggesting instances where the model failed to identify actual stroke cases.Importantly, there were 0 False Positives, indicating that the model did not make any incorrect predictions regarding the occurrence of strokes.On the positive side, the model accurately identified 616 instances as non-strokes (True Negatives).The sensitivity (recall) of the model, calculated as TP/ (TP + FN), is a crucial metric for stroke prediction, and achieving a high True Positive rate with zero False Positives is favorable for the reliability of the Gaussian Naive Bayes Classifier in identifying stroke cases.
(B) Results As described in Section "Materials and methods", this study used the dataset described in that section to conduct an experiment.For training the data, 80% of the dataset was configured, and for testing, 20% of the dataset was used.We have developed a machine learning algorithm using gradient boosting and random forest, and classifiers include the decision tree classifier, Support Vector Machine (SVM) classifier, logistic regression classifier, gradient boosting classifier, random forest classifier, K neighbors classifier, and Xtreme gradient boosting www.nature.com/scientificreports/classifier to classify stroke patients.In Table 4, we can see that the developed classification model and their accuracy, precision, recall, and F1-score values.Figure 8 shows the comparison graph for evaluation metrics.In order to develop a classification, 19 stroke attributes were used as input variables and two target stroke values as output variables.Random forest surpasses other classifiers in terms of accuracy (98.71) based on the above equation.
It exhibits the best level of accuracy among machine learning algorithms.Table 4 provides a comparison of the machine learning strategies, and Fig. 9 presents a chart demonstrating the RF method's advantage over the other ML methods based on performance metrics determined by equation-based calculations (1-4).
The area under the curve (AUC)-ROC shows classification problems performance at various thresholds.Probability curves are represented by ROCs, and the term AUC refers to the degree or measurement of separability.It reveals how well the model can differ across classes.The AUC is a prediction of the likelihood that a classifier would select a positive instance at random and score it higher than a negative instance at random.Because of this, the AUC is frequently regarded as a superior metric to the classification error rate.With the random forest technique, Fig. 10 provides the area under the ROC curve for Classification Model comparison.Figure 11 shows the model comparison of traing and testing Accuracy of all models.The RF algorithm clearly surpasses all other algorithms in the comparison in every way.

Conclusion
Treatment of strokes is essential to prevent future complications, as it is a potentially fatal medical condition.In order to mitigate stroke's severe consequences, an ML model may help with early stroke detection.Based on physiological factors, various machine learning algorithms are investigated in this study for their ability to accurately predict strokes.Among the other approaches examined, the random forest approach stands out due to its 98% classification accuracy.The random forest approach performs better than existing stroke prediction techniques, according to the study.In the future, this research could be extended to a larger dataset and applied to machine learning techniques including AdaBoost and Bagging.As a result, both the framework's dependability and its presentation will be improved.With the help of machine learning architecture, the public may be able to calculate their risk of developing a stroke in adult patients for the price of providing some basic information.As a result, patients would be able to receive early stroke treatment and recover in a timely manner. https://doi.org/10.1038/s41598-024-70354-1 below.This suggested system has the following six phases: (1) Importing a dataset of strokes; (2) Preprocessing of the data; (3) Data splitting; (4) Feature selection; (5) classifiers; and (6) Classifier Evaluation.

( 10 )
P 2 : α T x + b = −1 Vol:.(1234567890)Scientific Reports | (2024) 14:20053 | https://doi.org/10.1038/s41598-024-70354-1www.nature.com/scientificreports/(5) K-Nearest Neighbors (KNN) below represents the confusion matrix for each model.The logistic regression confusion matrix with TP (True Positive) of 357, FN (False Negative) of 133, FP (False Positive) of 90, and TN (True Negative) of 526 provides insights into the model performance in predicting strokes.The high TP value indicates that the model is effective at correctly identifying positive cases, while the FN and FP values highlight areas for improvement.The 133 FN instances suggest instances where the model failed to detect actual strokes, indicating a need for increased sensitivity.Simultaneously, the 90 FP instances indicate cases where the model incorrectly predicted strokes, signaling a requirement for enhanced specificity.The accuracy, precision, recall, specificity, and F1 Score metrics derived from these values offer a nuanced assessment of the model overall correctness, its ability to identify true positives, and the balance between precision and recall.Careful consideration of these metrics is crucial for refining the model predictive capabilities and optimizing its performance in stroke prediction.The confusion matrix for the decision tree classifier with TP (True Positive) of 423, FN (False Negative) of 67, FP (False Positive) of 0, and TN (True Negative) of 616 indicates a robust performance of the model in predicting strokes.The absence of FP instances suggests that the model did not incorrectly predict any non-stroke cases as positive, demonstrating high specificity.The substantial TP count signifies the model effectiveness in accurately identifying positive cases, showcasing its high sensitivity.The 67 FN instances indicate cases where the model failed to detect actual strokes, suggesting a potential area for improvement in enhancing sensitivity.Overall, the model exhibits a strong balance between precision and recall, with high accuracy, precision, recall, specificity, and F1 Score metrics.Further analysis and fine-tuning may focus on minimizing false negatives while maintaining the model excellent performance in correctly identifying positive cases.Extreme gradient boosting classifier Gaussian Naive Bayes classifierTable https://doi.org/10.1038/s41598-024-70354-1

Figure 9 .
Figure 9. comparison chart of evaluation metrics.

Figure 11 .
Figure 11.Comparison of Training and Testing Accuracy of all models.

Table 1 .
comparison table of related work.

Table 4 .
Comparison of machine learning approaches.